---
title: Assemble structured custom models
description: DataRobot provides built-in support for a variety of libraries to create models that use conventional target types.

---

# Assemble structured custom models

DataRobot provides built-in support for a variety of libraries to create models that use conventional target types. If your model is based on one of these libraries, DataRobot expects your model artifact to have a matching file extension:

=== "Python libraries"

    | Library                      | File Extension | Example               |
    |------------------------------|----------------|-----------------------|
    | Scikit-learn                 | *.pkl          | sklean-regressor.pkl  |
    | Xgboost                      | *.pkl          | xgboost-regressor.pkl |
    | PyTorch                      | *.pth          | torch-regressor.pth   |
    | tf.keras (tensorflow>=2.2.1) | *.h5           | keras-regressor.h5    |
    | ONNX                         | *.onnx         | onnx-regressor.onnx   |
    | pmml                         | *.pmml         | pmml-regressor.pmml   |

=== "R libraries"

    | Library | File Extension | Example            |
    |---------|----------------|--------------------|
    | Caret   | *.rds          | brnn-regressor.rds |

=== "Java libraries"

    | Library                  | File Extension | Example                                      |
    |--------------------------|----------------|----------------------------------------------|
    | datarobot-prediction     | *.jar          | dr-regressor.jar                             |
    | h2o-genmodel             | *.java         | GBM_model_python_1589382591366_1.java (pojo) |
    | h2o-genmodel             | *.zip          | GBM_model_python_1589382591366_1.zip (mojo)  |
    | h2o-genmodel-ext-xgboost | *.java         | XGBoost_2_AutoML_20201015_144158.java        |
    | h2o-genmodel-ext-xgboost | *.zip          | XGBoost_2_AutoML_20201015_144158.zip         |
    | h2o-ext-mojo-pipeline    | *.mojo         | ...                                          |

    !!! note
        * DRUM supports models with DataRobot-generated Scoring Code and models that implement either the `IClassificationPredictor` or `IRegressionPredictor` interface from the <a target="_blank" href="https://mvnrepository.com/artifact/com.datarobot/datarobot-prediction">DataRobot-prediction library</a>. The model artifact must have a `.jar` extension.

        * You can define the `DRUM_JAVA_XMX` environment variable to set JVM maximum heap memory size (`-Xmx` java parameter): `DRUM_JAVA_XMX=512m`.

        * If you export an H2O model as `POJO`, you cannot rename the file; however, this limitation doesn't apply to models exported as `MOJO`&mdash;they may be named in any fashion.

        * The `h2o-ext-mojo-pipeline` requires an h2o driverless AI license.

        * Support for DAI Mojo Pipeline has not been incorporated into tests for the build of `datarobot-drum`.

If your model doesn't use one of the following libraries, you must create an [unstructured custom model](unstructured-custom-models).

{% include 'includes/structured-vs-unstructured-cus-models.md' %}

## Structured custom model requirements {: #structured-custom-model-requirements }

If your custom model uses one of the supported libraries, make sure it meets the following requirements:

* Data sent to a model must be usable for predictions without additional pre-processing.
* Regression models must return a single floating point per row of prediction data.
* Binary classification models must return one floating point value <= 1.0 or two floating point values that sum to 1.0 per row of prediction data.
    * Single-value output is assumed to be the positive class probability.
    * For multi-value, it is assumed that the first value is the negative class probability and the second is the positive class probability.
* There must be a single `pkl`/`pth`/`h5` file present.

!!! note "Data format"
    When working with structured models DataRobot supports data as files of `csv`, `sparse`, or `arrow` format. DataRobot doesn't sanitize missing or abnormal (containing parentheses, slashes, symbols, etc. ) column names.

## Structured custom model hooks {: #structured-custom-model-hooks }

To define a custom model using DataRobot’s framework, your artifact file should contain hooks (or functions) to define how a model is trained and how it scores new data. DataRobot automatically calls each hook and passes the parameters based on the project and blueprint configuration. However, you have full flexibility to define the logic that runs inside each hook. If necessary, you can include these hooks alongside your model artifacts in your model folder in a file called `custom.py` for Python models or `custom.R` for R models.

!!! note
    Training and inference hooks can be defined in the same file.

The following sections describe each hook, with examples.

??? note "Type annotations in hook signatures"
    The following hook signatures are written with Python 3 type annotations. The Python types match the following R types:

    Python type         | R type       | Description
    --------------------|--------------|------------
    `DataFrame`         | `data.frame` | A numpy `DataFrame` or R `data.frame`.
    `None`              | `NULL`       | Nothing
    `str`               | `character`  | String
    `Any`               | An R object  | The deserialized model.
    `*args`, `**kwargs` | `...`        | These are keyword arguments, not types; they serve as placeholders for additional parameters.


**************************************************


### `init()` {: #init }

The `init` hook is executed only once at the beginning of the run to allow the model to load libraries and additional files for use in other hooks.

``` py
init(**kwargs) -> None
```

#### `init()` input {: #init-input }

Input parameter | Description
----------------|------------
`**kwargs`      | An additional keyword argument. `code_dir` provides a link, passed through the `--code_dir` parameter, to the folder where the model code is stored.


#### `init()` example {: #init-example }

The following provides a brief code snippet using `init()`; see a more complete example [here](https://github.com/datarobot/datarobot-user-models/blob/master/model_templates/2_estimators/5_r_binary_classification/custom.R){ target=_blank }.

=== "Python"

    ``` py
    def init(code_dir):
        global g_code_dir
        g_code_dir = code_dir
    ```

=== "R"

    ``` r
    init <- function(...) {
        library(brnn)
        library(glmnet)
    }
    ```

#### `init()` output {: #init-output }

The `init()` hook does not return anything.


**************************************************


### `load_model()` {: #load-model }

The `load_model()` hook is executed only once at the beginning of the run to load one or more trained objects from multiple artifacts. It is only required when a trained object is stored in an artifact that uses an unsupported format or when multiple artifacts are used. The `load_model()` hook is not required when there is a single artifact in one of the supported formats:

* Python: `.pkl`, `.pth`, `.h5`, `.joblib`
* Java: `.mojo`
* R: `.rds`

``` py
load_model(code_dir: str) -> Any
```

#### `load_model()` input {: #load-model-input }

Input parameter | Description
----------------|------------
`code_dir`      | A link, passed through the `--code_dir` parameter, to the directory where the model artifact and additional code are provided.


#### `load_model()` example {: #load-model-example }

The following provides a brief code snippet using `load_model()`; see a more complete example [here](https://github.com/datarobot/datarobot-user-models/blob/master/model_templates/3_pipelines/14_python3_keras_joblib/custom.py){ target=_blank }.

=== "Python"

    ``` py
    def load_model(code_dir):
        model_path = "model.pkl"
        model = joblib.load(os.path.join(code_dir, model_path))
    ```


=== "R"

    ``` r
    load_model <- function(input_dir) {
        readRDS(file.path(input_dir, "model_name.rds"))
    }
    ```

#### `load_model()` output {: #load-model-output }

The `load_model()` hook returns a trained object (of any type).


**************************************************


### `read_input_data()` {: #read-input-data }

The `read_input_data` hook customizes how the model reads data; for example, with encoding and missing value handling.

``` py
read_input_data(input_binary_data: bytes) -> Any
```

#### `read_input_data()` input {: #read-input-data-input }

Input parameter     | Description
--------------------|------------
`input_binary_data` | Data passed through the `--input` parameter in `drum score` mode, or a payload submitted to the `drum server` `/predict` endpoint.


#### `read_input_data()` example {: #read-input-data-example }

=== "Python"

    ``` py
    def read_input_data(input_binary_data):
        global prediction_value
        prediction_value += 1
        return pd.read_csv(io.BytesIO(input_binary_data))
    ```


=== "R"

    ``` r
    read_input_data <- function(input_binary_data) {
        input_text_data <- stri_conv(input_binary_data, "utf8")
        read.csv(text=gsub("\r","", input_text_data, fixed=TRUE))
    }
    ```

#### `read_input_data()` output {: #read-input-data-output }

The `read_input_data()` hook must return a pandas `DataFrame` or R `data.frame`; otherwise, you must write your own score method.


**************************************************


### `transform()` {: #transform }

The `transform()` hook defines the output of a custom transform and returns transformed data. Do not use this hook for estimator models. This hook can be used in both transformer and estimator tasks:

* For transformers, this hook applies transformations to the data provided and passes it to downstream tasks.

* For estimators, this hook applies transformations to the prediction data before making predictions.

``` py
transform(data: DataFrame, model: Any) -> DataFrame
```

#### `transform()` input {: #transform-input }

Input parameter | Description
----------------|------------
`data`          | A pandas `DataFrame` (Python) or R `data.frame` containing the data that the custom model should transform. Missing values are indicated with `NaN` in Python and `NA` in R, unless otherwise overridden by the `read_input_data` hook.
`model`         | A trained object DataRobot loads from the artifact (typically, a trained transformer) or loaded through the `load_model` hook.

#### `transform()` example {: #transform-example }

The following provides a brief code snippet using `transform()`; see a more complete example [here](https://github.com/datarobot/datarobot-user-models/blob/master/model_templates/1_transforms/1_python_missing_values/custom.py){ target=_blank }.


=== "Python"

    ``` py
    def transform(data, model):
        data = data.fillna(0)
        return data
    ```

=== "R"

    ``` r
    transform <- function(data, model) {
        data[is.na(data)] <- 0
        data
    }
	```

#### `transform()` output {: #transform-output }

The `transform()` hook returns a pandas `DataFrame` or R `data.frame` with transformed data.


**************************************************


### `score()` {: #score }

The `score()` hook defines the output of a custom estimator and returns predictions on input data. Do not use this hook for transform models.

``` py
score(data: DataFrame, model: Any, **kwargs: Dict[str, Any]) -> DataFrame
```

#### `score()` input {: #score-input }

Input parameter | Description
----------------|------------
`data`          | A pandas DataFrame (Python) or R data.frame containing the data the custom model will score. If the `transform` hook is used, `data` will be the transformed data.
`model`         | A trained object loaded from the artifact by DataRobot or loaded through the `load_model` hook.
`**kwargs`      | Additional keyword arguments. For a binary classification model, it contains the positive and negative class labels as the following keys:<ul><li>`positive_class_label`</li><li>`negative_class_label`</li></ul>


#### `score()` examples {: #score-examples }

The following provides a brief code snippet using `score()`; see a more complete example [here](https://github.com/datarobot/datarobot-user-models/blob/master/model_templates/2_estimators/4_python_binary_classification/custom.py){ target=_blank }.


=== "Python"

    ``` py
    def score(data: pd.DataFrame, model: Any, **kwargs: Dict[str, Any]) -> pd.DataFrame:
        predictions = model.predict(data)
        predictions_df = pd.DataFrame(predictions, columns=[kwargs["positive_class_label"]])
        predictions_df[kwargs["negative_class_label"]] = (
            1 - predictions_df[kwargs["positive_class_label"]]
        )

        return predictions_df
    ```

=== "R"

    ``` r
    score <- function(data, model, ...){
        scores <- predict(model, newdata = data, type = "prob")
        names(scores) <- c('0', '1')
        return(scores)
    }
    ```

#### `score()` output {: #score-output }

The `score()` hook should return a pandas `DataFrame` (or R `data.frame` or `tibble`) of the following format:

* For regression or anomaly detection projects, the output must have a single numeric column named **Predictions**.

* For binary or multiclass projects, the output must have one column per class, with class names used as column names. Each cell must contain the probability of the respective class, and each row must sum up to 1.0.


**************************************************


### `post_process()` {: #post-process }

The `post_process` hook formats the prediction data returned by DataRobot or the `score` hook when it doesn't match the output format expectations.

``` py 
post_process(predictions: DataFrame, model: Any) -> DataFrame
```

#### `post_process()` input {: #post-process-input }

Input parameter | Description
----------------|------------
`predictions`   | A pandas DataFrame (Python) or R data.frame containing the scored data produced by DataRobot or the `score` hook.
`model`         | A trained object loaded from the artifact by DataRobot or loaded through the `load_model` hook.

#### `post_process()` example {: #post-process-example }

=== "Python"

    ``` py
    def post_process(predictions, model):
        return predictions + 1
    ```

=== "R"

    ``` r
    post_process <- function(predictions, model) {
        names(predictions) <- c('0', '1')
    }
    ```

#### `post_process()` output {: #post-process-output }

The `post_process` hook returns a pandas `DataFrame` (or R `data.frame` or `tibble`) of the following format:

* For regression or anomaly detection projects, the output must have a single numeric column named **Predictions**.

* For binary or multiclass projects, the output must have one column per class, with class names used as column names. Each cell must contain the probability of the respective class, and each row must sum up to 1.0.
